my_matrix <- matrix(1:12, nrow = 3)
dim(my_matrix)Lecture 5:
Rectangular data
2024-10-17
Understand that computer code and data are stored as text files
Understand how we import data from text files
Learn data structures in R
Exercise: read financial data from a text file -> today
We distinguish two basic characteristics:
R-specific)Source: http://venus.ifca.unican.es/Rintro/dataStruct.html
00000000: efbb bf6e 616d 652c 6167 655f 696e 5f79 ...name,age_in_y
00000010: 6561 7273 0d0a 4a6f 686e 2c32 340d 0a41 ears..John,24..
00000020: 6e6e 612c 3239 0d0a 4265 6e2c 3331 0d0a nna,29..Ben,31..
00000030: 4c69 7a2c 3334 0d0a 4d61 782c 3237 Liz,34..Max,27
00000000: efbb bf6e 616d 652c 6167 655f 696e 5f79 ...name,age_in_y
00000010: 6561 7273 0d0a 4a6f 686e 2c32 340d 0a41 ears..John,24..
00000020: 6e6e 612c 3239 0d0a 4265 6e2c 3331 0d0a nna,29..Ben,31..
00000030: 4c69 7a2c 3334 0d0a 4d61 782c 3237 Liz,34..Max,27
R creates a matrix of dimension 3, 4my_matrix[2, 1] == "2" gives the solution TRUER must coerce the data to a common type to accommodate all different valuesmean(my_matrix[,1]) == 2.5 returns 2.5Today
.xls).sav, STATA: .dat, etc.)“The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that share an underlying design philosophy, grammar, and data structures” of tidy data.” (Wikipedia)
Source: https://www.storybench.org/wp-content/uploads/2017/05/tidyverse.png
RtidyverseConsider the swiss-dataset stored in a CSV:
"District","Fertility","Agriculture","Examination","Education","Catholic","Infant.Mortality"
"Courtelary",80.2,17,15,12,9.96,22.2
read.csv() (basic R distribution)data.frameread_csv() (readr/tidyr-package)
tibble.data.table-package and the fread() function (handling large datasets).readrreadr?
readr functionsParse the first lines of the swiss dataset directly like this…
library(readr)
read_csv('"District","Fertility","Agriculture","Examination","Education","Catholic","Infant.Mortality"
"Courtelary",80.2,17,15,12,9.96,22.2')# A tibble: 1 × 7
District Fertility Agriculture Examination Education Catholic Infant.Mortality
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Courtelary 80.2 17 15 12 9.96 22.2
or read the entire swiss dataset by pointing to the file
readr functionsIn either case, the result is a tibble:
# A tibble: 47 × 7
District Fertility Agriculture Examination Education Catholic Infant.Mortality
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Courtelary 80.2 17 15 12 9.96 22.2
2 Delemont 83.1 45.1 6 9 84.8 22.2
3 Franches-Mnt 92.5 39.7 5 5 93.4 20.2
4 Moutier 85.8 36.5 12 7 33.8 20.3
5 Neuveville 76.9 43.5 17 15 5.16 20.6
6 Porrentruy 76.1 35.3 9 7 90.6 26.6
7 Broye 83.8 70.2 16 7 92.8 23.6
8 Glane 92.4 67.8 14 8 97.2 24.9
9 Gruyere 82.4 53.3 12 7 97.7 21
10 Sarine 82.9 45.2 16 13 91.4 24.4
# ℹ 37 more rows
readr functionsreadr functions have practically the same syntax and behavior.read_tsv() (tab-separated)read_fwf() (fixed-width)Recognizing columns and rows is one thing…
# A tibble: 47 × 7
District Fertility Agriculture Examination Education Catholic Infant.Mortality
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Courtelary 80.2 17 15 12 9.96 22.2
2 Delemont 83.1 45.1 6 9 84.8 22.2
3 Franches-Mnt 92.5 39.7 5 5 93.4 20.2
4 Moutier 85.8 36.5 12 7 33.8 20.3
5 Neuveville 76.9 43.5 17 15 5.16 20.6
6 Porrentruy 76.1 35.3 9 7 90.6 26.6
7 Broye 83.8 70.2 16 7 92.8 23.6
8 Glane 92.4 67.8 14 8 97.2 24.9
9 Gruyere 82.4 53.3 12 7 97.7 21
10 Sarine 82.9 45.2 16 13 91.4 24.4
# ℹ 37 more rows
read_csv() recognize?data.frame/tibble, etc.character, numeric, etc.read_csv() guess the data types."12:00": type character?"12:00": type character?c("12:00", "midnight", "noon")?"12:00": type character?c("12:00", "midnight", "noon")?c("12:00", "14:30", "20:01")?# A tibble: 3 × 2
A B
<time> <chr>
1 12:00 12:00
2 14:30 midnight
3 20:01 noon
read_csv() distinguish the two cases?Under the hood read_csv() used the guess_parser()- function to determine which type the two vectors likely contain:
Under the hood read_csv() used the guess_parser()- function to determine which type the two vectors likely contain:
Re-load the swiss dataset, or load the built-in dataset.
In order to load built-in datasets, simply use the data()-function.
Similar!
# A tibble: 47 × 7
District Fertility Agriculture Examination Education Catholic Infant.Mortality
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Courtelary 80.2 17 15 12 9.96 22.2
2 Delemont 83.1 45.1 6 9 84.8 22.2
3 Franches-Mnt 92.5 39.7 5 5 93.4 20.2
4 Moutier 85.8 36.5 12 7 33.8 20.3
5 Neuveville 76.9 43.5 17 15 5.16 20.6
6 Porrentruy 76.1 35.3 9 7 90.6 26.6
7 Broye 83.8 70.2 16 7 92.8 23.6
8 Glane 92.4 67.8 14 8 97.2 24.9
9 Gruyere 82.4 53.3 12 7 97.7 21
10 Sarine 82.9 45.2 16 13 91.4 24.4
# ℹ 37 more rows
District Fertility Agriculture Examination Education Catholic Infant.Mortality
1 Courtelary 80.2 17.0 15 12 9.96 22.2
2 Delemont 83.1 45.1 6 9 84.84 22.2
3 Franches-Mnt 92.5 39.7 5 5 93.40 20.2
4 Moutier 85.8 36.5 12 7 33.77 20.3
5 Neuveville 76.9 43.5 17 15 5.16 20.6
6 Porrentruy 76.1 35.3 9 7 90.57 26.6
7 Broye 83.8 70.2 16 7 92.85 23.6
8 Glane 92.4 67.8 14 8 97.16 24.9
9 Gruyere 82.4 53.3 12 7 97.67 21.0
10 Sarine 82.9 45.2 16 13 91.38 24.4
11 Veveyse 87.1 64.5 14 6 98.61 24.5
12 Aigle 64.1 62.0 21 12 8.52 16.5
13 Aubonne 66.9 67.5 14 7 2.27 19.1
14 Avenches 68.9 60.7 19 12 4.43 22.7
15 Cossonay 61.7 69.3 22 5 2.82 18.7
16 Echallens 68.3 72.6 18 2 24.20 21.2
17 Grandson 71.7 34.0 17 8 3.30 20.0
18 Lausanne 55.7 19.4 26 28 12.11 20.2
19 La Vallee 54.3 15.2 31 20 2.15 10.8
20 Lavaux 65.1 73.0 19 9 2.84 20.0
21 Morges 65.5 59.8 22 10 5.23 18.0
22 Moudon 65.0 55.1 14 3 4.52 22.4
23 Nyone 56.6 50.9 22 12 15.14 16.7
24 Orbe 57.4 54.1 20 6 4.20 15.3
25 Oron 72.5 71.2 12 1 2.40 21.0
26 Payerne 74.2 58.1 14 8 5.23 23.8
27 Paysd'enhaut 72.0 63.5 6 3 2.56 18.0
28 Rolle 60.5 60.8 16 10 7.72 16.3
29 Vevey 58.3 26.8 25 19 18.46 20.9
30 Yverdon 65.4 49.5 15 8 6.10 22.5
31 Conthey 75.5 85.9 3 2 99.71 15.1
32 Entremont 69.3 84.9 7 6 99.68 19.8
33 Herens 77.3 89.7 5 2 100.00 18.3
34 Martigwy 70.5 78.2 12 6 98.96 19.4
35 Monthey 79.4 64.9 7 3 98.22 20.2
36 St Maurice 65.0 75.9 9 9 99.06 17.8
37 Sierre 92.2 84.6 3 3 99.46 16.3
38 Sion 79.3 63.1 13 13 96.83 18.1
39 Boudry 70.4 38.4 26 12 5.62 20.3
40 La Chauxdfnd 65.7 7.7 29 11 13.79 20.5
41 Le Locle 72.7 16.7 22 13 11.22 18.9
42 Neuchatel 64.4 17.6 35 32 16.92 23.0
43 Val de Ruz 77.6 37.6 15 7 4.97 20.0
44 ValdeTravers 67.6 18.7 25 7 8.65 19.5
45 V. De Geneve 35.0 1.2 37 53 42.34 18.0
46 Rive Droite 44.7 46.6 16 29 50.43 18.2
47 Rive Gauche 42.8 27.7 22 29 58.33 19.3
'data.frame': 47 obs. of 7 variables:
$ District : chr "Courtelary" "Delemont" "Franches-Mnt" "Moutier" ...
$ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
$ Examination : num 15 6 5 12 17 9 16 14 12 16 ...
$ Education : num 12 9 5 7 15 7 7 8 7 13 ...
$ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
$ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
District Fertility Agriculture Examination Education Catholic Infant.Mortality
1 Courtelary 80.2 17.0 15 12 9.96 22.2
2 Delemont 83.1 45.1 6 9 84.84 22.2
3 Franches-Mnt 92.5 39.7 5 5 93.40 20.2
4 Moutier 85.8 36.5 12 7 33.77 20.3
5 Neuveville 76.9 43.5 17 15 5.16 20.6
Needs the additional R-package: readxl. Then we use the package’s read_excel()-function to import data from an excel-sheet.
openxlsx to write, style, and edit .xlsx files.STATA, SPSS, etc.
Additional packages needed:
foreignhavenParsers (functions) for many foreign formats.
read_spss() for SPSS’ .sav-format.Tell your future self what this script is all about 🤓🔮💻
#######################################################################
# Data Handling Course: Example Script for Data Gathering and Import
#
# Imports data from ...
# Input: import c to data sources (data comes in ... format)
# Output: cleaned data as CSV
#
# A. Sallin, St. Gallen, 2024
#######################################################################Tell your future self what this script is all about 🤓🔮💻
---------- to indicate the beginning of sections.
#######################################################################
# Data Handling Course: Example Script for Data Gathering and Import
#
# Imports data from ...
# Input: import c to data sources (data comes in ... format)
# Output: cleaned data as CSV
#
# A. Sallin, St. Gallen, 2024
##############################################################################################################################################
# Data Handling Course: Example Script for Data Gathering and Import
#
# Imports data from ...
# Input: import c to data sources (data comes in ... format)
# Output: cleaned data as CSV
#
# A. Sallin, St. Gallen, 2024
#######################################################################
# SET UP --------------
# load packages
library(tidyverse)
# set fix variables
INPUT_PATH <- "/rawdata"
OUTPUT_FILE <- "/final_data/datafile.csv"
# IMPORT RAW DATA FROM CSVs -------------
# End -------------Let’s code!